enhance: add per-category score to MultiRiskGraniteGuardianTool via logprobs by NISH1001 · Pull Request #418 · NASA-IMPACT/akd-core

NISH1001 · 2026-04-13T21:47:13Z

What

Added score_threshold config field to MultiRiskGraniteGuardianToolConfig to optionally drop low-confidence detections and reduce false positives. Defaults to 0.0 (no filter) to preserve current behavior.
Each entry in risk_results now includes a per-category score in [0, 1], derived from the logprob of the category's first emitted token in Step 2.
Categories whose per-category score is below score_threshold are dropped from detected_risks and risk_results. Categories without a score (e.g., when Ollama does not return logprobs) pass through unfiltered as a graceful fallback.
Step 1 decision remains label-based (Yes/No) as per the model card; no extra Ollama calls are added.

Why

The multi-harm model's self-reported text confidence (e.g., "High", "Not Harmful") is effectively binary in practice and useless for thresholding. False positives on some categories (e.g., Harmful flagged on sarcasm) could not be filtered without manual category exclusion. Per-category logprob-derived scores give callers a real numeric signal to threshold on (e.g., Violence=0.97 vs Harmful=0.35 for the same input).

Scoring methodology (first-token logprob)

We use the first token's logprob as the per-category confidence score. Here's how it works with an example:

Step 2 model output: Violence, Unethical Behavior, Harmful
Ollama returns per-token logprobs for the generated text:

Token logprob

Violence -0.02

, -0.01

-0.00

Un -0.70

eth -0.03

ical -0.01

Behavior -0.02

, -0.01

-0.00

Harm -1.20

ful -0.01
We walk the token stream, skip commas/whitespace, and grab the first substantive token per category
Score = exp(first_token_logprob):
- Violence → exp(-0.02) = 0.98 (1 token, first token = full category)
- Unethical Behavior → exp(-0.70) = 0.50 (first token = Un)
- Harmful → exp(-1.20) = 0.30 (first token = Harm)

Why first-token only (not joint probability across all tokens)?

The first token is the model's decision point — when it emits Un, it has already committed to Unethical Behavior. Subsequent tokens (eth, ical, Behavior) are near-deterministic completions (logprobs ~0) that reflect spelling ability, not category confidence.
Joint probability (product of all token probs) introduces length bias: longer category names always score lower regardless of model confidence. e.g., Violence (1 token) would always outscore Unethical Behavior (4 tokens) even if the model is equally confident in both.
Geometric mean (Nth root of joint prob) corrects length bias — maybe worth exploring, but likely not worth the complexity since the first token already captures the model's category selection confidence and subsequent tokens are near-deterministic completions.

How

Changes are in akd/guardrails/providers/granite_guardian.py
Added score_threshold config (0.0–1.0, default 0.0) to optionally filter low-confidence categories
Step 2 Ollama request now includes logprobs: True and top_logprobs: 5
Category parsing returns a {category: score} dict; score filtering and risk_results building happen in _arun

How to test

uv run pytest tests/guardrails/ — existing tests still pass (8 failures in test_granite_think.py are pre-existing and require a live granite3.3-guardian:8b model).
uv run python scripts/test_multi_harm.py with live Ollama + granite-guardian-3.2-5b-multi-harm-GGUF — verifies per-category scores appear in risk_results (e.g., Violence=0.97, Unethical Behavior=0.76 for the same violent input; Harmful=0.35 on sarcasm, correctly flagged as low-confidence).

…probs ### What changes I have done - Added `score_threshold` config field to `MultiRiskGraniteGuardianToolConfig` to optionally drop low-confidence detections and reduce false positives. Defaults to `0.0` (no filter) to preserve current behavior. - Each entry in `risk_results` now includes a per-category `score` in [0, 1], derived from the logprob of the category's first emitted token in Step 2. - Categories whose per-category score is below `score_threshold` are dropped from `detected_risks` and `risk_results`. Categories without a score (e.g., when Ollama does not return logprobs) pass through unfiltered as a graceful fallback. - Step 1 decision remains label-based (`Yes`/`No`) as per the model card; no extra Ollama calls are added. ### Why The multi-harm model's self-reported text confidence (e.g., `"High"`, `"Not Harmful"`) is effectively binary in practice and useless for thresholding. False positives on some categories (e.g., `Harmful` flagged on sarcasm) could not be filtered without manual category exclusion. Per-category logprob-derived scores give callers a real numeric signal to threshold on (e.g., `Violence=0.97` vs `Harmful=0.35` for the same input). ### How I made the changes - `akd/guardrails/providers/granite_guardian.py`: - `MultiRiskGraniteGuardianToolConfig`: added `score_threshold: float = 0.0` with `ge=0.0, le=1.0` validation. - `_call_category_detection`: added top-level `logprobs: True` and `top_logprobs: 5` to the Ollama `/api/generate` request body (Ollama accepts these as top-level params, not inside `options`). - `_parse_categories_with_scores`: new helper that parses comma-separated categories and computes per-category scores as `exp(first_token_logprob)`. - `_first_token_logprob_per_category`: new static helper that walks the token stream, skipping whitespace/commas, and returns the logprob of the first token of each emitted category. - `_parse_categories`: kept as a thin wrapper over `_parse_categories_with_scores` for backward compatibility. - `_arun`: applies `score_threshold` as a filter in the Step 2 detected-categories list comprehension; builds `risk_results` with `{"is_risky": True, "score": <float|None>}` per category. ### How to test - `uv run pytest tests/guardrails/` — existing tests still pass (8 failures in `test_granite_think.py` are pre-existing and require a live `granite3.3-guardian:8b` model). - `uv run python scripts/test_multi_harm.py` with live Ollama + `granite-guardian-3.2-5b-multi-harm-GGUF` — verifies per-category scores appear in `risk_results` (e.g., Violence=0.97, Unethical Behavior=0.76 for the same violent input; Harmful=0.35 on sarcasm, correctly flagged as low-confidence).

github-actions · 2026-04-13T21:55:48Z

❌ Tests failed (exit code: 1)

📊 Test Results

Passed: 579
Failed: 2
Skipped: 39
Warnings: 183
Coverage: 76%

Branch: feature/logprobs-granite
PR: #418
Commit: 0147d53

📋 Full coverage report and logs are available in the workflow run.

### What - Merged `_parse_categories` and `_parse_categories_with_scores` into a single `_parse_categories` method that returns `dict[GraniteHarmCategory, float | None]` (category -> per-category score). - Removed the redundant `"scores"` key from the Step 2 return dict; `"categories"` now holds both the categories and their scores as a dict mapping. - `unfiltered_categories` in `extra` now preserves per-category scores alongside the category list (previously was a bare list). ### Why The previous split had `_parse_categories` as a thin list-returning wrapper over `_parse_categories_with_scores` purely for backward compatibility, but `_parse_categories` was only called internally and had no external consumers — dead weight. A single dict return is also a more natural fit: `risk_results` is already a dict of category -> metadata, and downstream consumers need both iteration and score lookup. Dicts preserve insertion order in Python 3.7+, so the model's emission order is kept. ### How - `akd/guardrails/providers/granite_guardian.py`: - `_parse_categories`: now takes optional `token_logprobs`, returns `dict[GraniteHarmCategory, float | None]`. When `token_logprobs` is None/empty, scores are `None` (same behavior as the pre-logprobs version, just wrapped in dict keys instead of a list). - `_call_category_detection`: returns `{"categories": <dict>, "raw_response": ...}` (removed the separate `"scores"` key). - `_arun`: renamed local from `per_category_scores` to `category_scores`; iterates `category_scores.items()` directly in the filter comprehension; passes the whole `category_scores` dict as `extra["unfiltered_categories"]` to preserve scores in observability output. ### How to test - `uv run pytest tests/guardrails/ --ignore=tests/guardrails/test_granite_think.py` — all 21 tests pass. - `uv run python scripts/test_multi_harm.py` with live Ollama — confirms `risk_results` still has `score` per category and `extra["unfiltered_categories"]` now includes scores.

github-actions · 2026-04-14T00:35:43Z

❌ Tests failed (exit code: 1)

📊 Test Results

Passed: 579
Failed: 2
Skipped: 39
Warnings: 186
Coverage: 76%

Branch: feature/logprobs-granite
PR: #418
Commit: 5743d76

📋 Full coverage report and logs are available in the workflow run.

muthukumaranR

I think we need to figure out if there is a MultiHarm finetuning pipeline so we are not tied to 3.2. it's already outdated by a few versions.

cc: @jbrry

NISH1001 · 2026-04-14T16:14:17Z

I think we need to figure out if there is a MultiHarm finetuning pipeline so we are not tied to 3.2. it's already outdated by a few versions.

cc: @jbrry

True. We should probalby figure out that soon with LORA

NISH1001 temporarily deployed to integration April 13, 2026 21:47 — with GitHub Actions Inactive

NISH1001 requested a review from muthukumaranR April 13, 2026 21:47

NISH1001 commented Apr 13, 2026

View reviewed changes

Comment thread akd/guardrails/providers/granite_guardian.py Outdated

NISH1001 temporarily deployed to integration April 14, 2026 00:27 — with GitHub Actions Inactive

muthukumaranR reviewed Apr 14, 2026

View reviewed changes

Comment thread akd/guardrails/providers/granite_guardian.py

muthukumaranR reviewed Apr 14, 2026

View reviewed changes

muthukumaranR approved these changes Apr 14, 2026

View reviewed changes

NISH1001 merged commit 01bf23a into develop Apr 14, 2026
1 check passed

NISH1001 deleted the feature/logprobs-granite branch April 14, 2026 16:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhance: add per-category score to MultiRiskGraniteGuardianTool via logprobs#418

enhance: add per-category score to MultiRiskGraniteGuardianTool via logprobs#418
NISH1001 merged 2 commits into
developfrom
feature/logprobs-granite

NISH1001 commented Apr 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

Uh oh!

muthukumaranR left a comment

Uh oh!

NISH1001 commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Token	logprob
`Violence`	-0.02
`,`	-0.01
	-0.00
`Un`	-0.70
`eth`	-0.03
`ical`	-0.01
`Behavior`	-0.02
`,`	-0.01
	-0.00
`Harm`	-1.20
`ful`	-0.01

Conversation

NISH1001 commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Scoring methodology (first-token logprob)

How

How to test

Uh oh!

Uh oh!

github-actions Bot commented Apr 13, 2026

📊 Test Results

Uh oh!

github-actions Bot commented Apr 14, 2026

📊 Test Results

Uh oh!

Uh oh!

muthukumaranR left a comment

Choose a reason for hiding this comment

Uh oh!

NISH1001 commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NISH1001 commented Apr 13, 2026 •

edited

Loading